I. Introduction




II. Data source


Collection

Chao Yin is mainy responible for collection of team/player game stats data while Zeyu Yang is responsible for players’ biographical and salaries information.

Our data is collected from Basketball-Reference, Stats NBA and Kaggle.

  • Basketball Reference is a site providing both basic and sabermetric statistics and resources for basketball fans using offical NBA data.

  • Stats NBA is the home of NBA Advanced Stats and provides official NBA Statistics and advanced analytics.

  • Kaggle is an online community that allows users to find and publish data sets.

Data in Basketball-Reference is stored in XML so that we can directly extract them using packages XML and RCurl. However, some tables on this site are commented and they can only be downloaded manually in csv form thus we choose Stats NBA for other data. It’s a bit harder to extract data tables from Stats NBA than from Basketball-Reference since they are stored in json files. We use statsnbaR which provides utility functions to download data from the API end-points of Stats NBA. We got teams from Basketball -Reference and players from Stats NBA.

Kaggle is the source of player’s biographical data. The aforementioned two sites can also provide the same data but the data is harded to collect since it is not stored in tables.

Datasets and variables

players datasets contains all regular season information of all players in one season.


General data provides basic players’ performance including:

  • Profile information like Name, Team, Age, Game Played, Minutes Played, etc.

  • Shooting performance from 2 pointer, 3 pointer and free throw like Field Goalds Made, Field Goals Attempted, Field Goal Percentage, etc.

  • Basic stats per game like Rebounds, Assists, Steals, Blocks, Points, Turnovers, Personal Fouls, etc.


Advanced data measures and analysis player’s ability in one percific area :

  • Overall ratings like Offensive Rating, Defensive Rating, Net Rating, Player Impact Estimate, Usage Percentage, etc.

  • Passing/Assist ability like Assist Percentage, Assist to Turnover Ratio, Assist Ratio

  • Rebound ability like Offensive Rebound Percentage, Defensive Rebound Percentage, Rebound Percentage

  • Shooting ability like Effective Field Goal Percentage, True Shooting Percentage


Bio dataset contains players’ biographical data:

  • The year player starts playing at NBA and the year he retires

  • Height and weight data

  • Birth date

  • College attended



teams datasets contains similar information as shown in the players but corresponds to each team in the league. However, teams provides ways to split the data in order to measure the teams’ performance from different angles:

  • Location helps measure teams’ gaming performance at home or on the road respectively

  • Wins-Losses tells how the team played when they won or losed the game

  • Month and Pre/Post All Stars give teams’ performance changes over time periods

  • Days Rest tests teams abilities to handle tough schedules

Issues/Problems

Teams in NBA keep changing in these 15 years. Three teams change their team locations and team names thus we may find the teams are not necessarily the same each year. Players can be traded and signed during the season, which makes some players have more records than others in these datasets.

Height data in bio dataset is saved as character,such as “6-8”, which requires us to convert them to numeric.

Also all data are saved as factor, which requires us to convert them to numeric or character.



III. Data cleaning


After we got all the raw data in data/raw, we wanted to combine them into four datasets: Team_splits, Team_shoots, Player & Players_bio.

For the players’ data, we first remove empty rows and columns and turn the variables into numerics and characters according to their content. Considering more and more players can play more than one position today, we group the players into three kinds: Guards, Wings and Bigs instead of the origin positions they play. And finally we combind players data of all 15 years and got Player.

Scroll down the table to see more details

print(dfSummary(Player,headings = FALSE,plain.ascii = FALSE,valid.col = FALSE,graph.magnif = 0.75,style = "grid"),max.tbl.height = 500,method='render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 Player [character] 1. Corey Brewer 2. Kyle Korver 3. Andre Miller 4. Devin Harris 5. Mike James 6. Nazr Mohammed 7. Trevor Ariza 8. Drew Gooden 9. Pau Gasol 10. Shaun Livingston [ 1636 others ]
22(0.2%)
22(0.2%)
21(0.2%)
21(0.2%)
21(0.2%)
21(0.2%)
21(0.2%)
20(0.2%)
20(0.2%)
20(0.2%)
9351(97.8%)
0 (0%)
2 Pos [factor] 1. Guards 2. Wings 3. Bigs
3899(40.8%)
1798(18.8%)
3863(40.4%)
0 (0%)
3 Age [numeric] Mean (sd) : 26.7 (4.2) min < med < max: 18 < 26 < 44 IQR (CV) : 7 (0.2) 26 distinct values 0 (0%)
4 Tm [character] 1. TOT 2. HOU 3. CLE 4. MEM 5. NYK 6. LAC 7. PHI 8. WAS 9. DAL 10. MIL [ 26 others ]
964(10.1%)
315(3.3%)
314(3.3%)
306(3.2%)
303(3.2%)
296(3.1%)
294(3.1%)
293(3.1%)
291(3.0%)
291(3.0%)
5893(61.6%)
0 (0%)
5 G [numeric] Mean (sd) : 46.8 (26.2) min < med < max: 1 < 51 < 85 IQR (CV) : 49 (0.6) 85 distinct values 0 (0%)
6 GS [numeric] Mean (sd) : 22.1 (27.4) min < med < max: 0 < 7 < 83 IQR (CV) : 40 (1.2) 84 distinct values 0 (0%)
7 MP [numeric] Mean (sd) : 19.6 (9.9) min < med < max: 0 < 19 < 43.1 IQR (CV) : 16 (0.5) 410 distinct values 0 (0%)
8 FG [numeric] Mean (sd) : 2.9 (2.1) min < med < max: 0 < 2.4 < 12.2 IQR (CV) : 2.8 (0.7) 111 distinct values 0 (0%)
9 FGA [numeric] Mean (sd) : 6.6 (4.5) min < med < max: 0 < 5.5 < 27.2 IQR (CV) : 6.1 (0.7) 228 distinct values 0 (0%)
10 FG% [numeric] Mean (sd) : 0.4 (0.1) min < med < max: 0 < 0.4 < 1 IQR (CV) : 0.1 (0.2) 458 distinct values 52 (0.54%)
11 3P [numeric] Mean (sd) : 0.6 (0.7) min < med < max: 0 < 0.3 < 5.1 IQR (CV) : 1 (1.2) 44 distinct values 0 (0%)
12 3PA [numeric] Mean (sd) : 1.7 (1.8) min < med < max: 0 < 1.1 < 13.2 IQR (CV) : 2.7 (1.1) 95 distinct values 0 (0%)
13 3P% [numeric] Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.3 < 1 IQR (CV) : 0.2 (0.6) 380 distinct values 1467 (15.35%)
14 2P [numeric] Mean (sd) : 2.3 (1.8) min < med < max: 0 < 1.8 < 10.3 IQR (CV) : 2.3 (0.8) 99 distinct values 0 (0%)
15 2PA [numeric] Mean (sd) : 4.9 (3.6) min < med < max: 0 < 3.9 < 22.2 IQR (CV) : 4.7 (0.7) 198 distinct values 0 (0%)
16 2P% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1 IQR (CV) : 0.1 (0.2) 446 distinct values 95 (0.99%)
17 eFG% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1.5 IQR (CV) : 0.1 (0.2) 473 distinct values 52 (0.54%)
18 FT [numeric] Mean (sd) : 1.4 (1.4) min < med < max: 0 < 1 < 10.3 IQR (CV) : 1.4 (1) 92 distinct values 0 (0%)
19 FTA [numeric] Mean (sd) : 1.9 (1.7) min < med < max: 0 < 1.4 < 11.7 IQR (CV) : 1.8 (0.9) 112 distinct values 0 (0%)
20 FT% [numeric] Mean (sd) : 0.7 (0.2) min < med < max: 0 < 0.8 < 1 IQR (CV) : 0.2 (0.2) 582 distinct values 456 (4.77%)
21 ORB [numeric] Mean (sd) : 0.9 (0.8) min < med < max: 0 < 0.6 < 6 IQR (CV) : 0.9 (0.9) 54 distinct values 0 (0%)
22 DRB [numeric] Mean (sd) : 2.5 (1.8) min < med < max: 0 < 2.1 < 12 IQR (CV) : 2 (0.7) 111 distinct values 0 (0%)
23 TRB [numeric] Mean (sd) : 3.4 (2.4) min < med < max: 0 < 2.8 < 18 IQR (CV) : 2.8 (0.7) 148 distinct values 0 (0%)
24 AST [numeric] Mean (sd) : 1.7 (1.7) min < med < max: 0 < 1.2 < 12.8 IQR (CV) : 1.8 (1) 114 distinct values 0 (0%)
25 STL [numeric] Mean (sd) : 0.6 (0.4) min < med < max: 0 < 0.5 < 2.9 IQR (CV) : 0.5 (0.7) 30 distinct values 0 (0%)
26 BLK [numeric] Mean (sd) : 0.4 (0.5) min < med < max: 0 < 0.2 < 6 IQR (CV) : 0.4 (1.2) 39 distinct values 0 (0%)
27 TOV [numeric] Mean (sd) : 1.1 (0.8) min < med < max: 0 < 1 < 5.7 IQR (CV) : 0.9 (0.7) 51 distinct values 0 (0%)
28 PF [numeric] Mean (sd) : 1.8 (0.8) min < med < max: 0 < 1.8 < 6 IQR (CV) : 1.1 (0.5) 46 distinct values 0 (0%)
29 PTS [numeric] Mean (sd) : 7.8 (5.8) min < med < max: 0 < 6.4 < 36.1 IQR (CV) : 7.7 (0.7) 301 distinct values 0 (0%)
30 PER [numeric] Mean (sd) : 12.6 (6) min < med < max: -54.4 < 12.5 < 133.8 IQR (CV) : 5.9 (0.5) 412 distinct values 3 (0.03%)
31 TS% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1.5 IQR (CV) : 0.1 (0.2) 481 distinct values 25 (0.26%)
32 3PAr [numeric] Mean (sd) : 0.2 (0.2) min < med < max: 0 < 0.2 < 1 IQR (CV) : 0.4 (0.9) 784 distinct values 26 (0.27%)
33 FTr [numeric] Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.3 < 6 IQR (CV) : 0.2 (0.7) 778 distinct values 26 (0.27%)
34 ORB% [numeric] Mean (sd) : 5.5 (4.7) min < med < max: 0 < 4 < 100 IQR (CV) : 6.2 (0.9) 222 distinct values 3 (0.03%)
35 DRB% [numeric] Mean (sd) : 14.5 (6.5) min < med < max: 0 < 13.4 < 100 IQR (CV) : 8.5 (0.4) 354 distinct values 3 (0.03%)
36 TRB% [numeric] Mean (sd) : 10 (5) min < med < max: 0 < 8.9 < 86.4 IQR (CV) : 7.2 (0.5) 265 distinct values 3 (0.03%)
37 AST% [numeric] Mean (sd) : 12.7 (9.2) min < med < max: 0 < 9.8 < 78.5 IQR (CV) : 11 (0.7) 470 distinct values 3 (0.03%)
38 STL% [numeric] Mean (sd) : 1.6 (0.9) min < med < max: 0 < 1.5 < 12.5 IQR (CV) : 0.8 (0.5) 80 distinct values 3 (0.03%)
39 BLK% [numeric] Mean (sd) : 1.6 (1.7) min < med < max: 0 < 1 < 26.3 IQR (CV) : 1.8 (1.1) 109 distinct values 3 (0.03%)
40 TOV% [numeric] Mean (sd) : 14 (6.1) min < med < max: 0 < 13.2 < 100 IQR (CV) : 5.9 (0.4) 341 distinct values 21 (0.22%)
41 USG% [numeric] Mean (sd) : 18.6 (5.2) min < med < max: 0 < 18.2 < 53.7 IQR (CV) : 6.8 (0.3) 334 distinct values 3 (0.03%)
42 OWS [numeric] Mean (sd) : 1.2 (2) min < med < max: -3.3 < 0.5 < 14.8 IQR (CV) : 1.9 (1.6) 156 distinct values 0 (0%)
43 DWS [numeric] Mean (sd) : 1.2 (1.1) min < med < max: -0.6 < 0.9 < 9.1 IQR (CV) : 1.4 (1) 80 distinct values 0 (0%)
44 WS [numeric] Mean (sd) : 2.4 (2.8) min < med < max: -2.1 < 1.5 < 20.3 IQR (CV) : 3.5 (1.2) 184 distinct values 0 (0%)
45 WS/48 [numeric] Mean (sd) : 0.1 (0.1) min < med < max: -1.3 < 0.1 < 2.7 IQR (CV) : 0.1 (1.4) 557 distinct values 3 (0.03%)
46 OBPM [numeric] Mean (sd) : -1.7 (3.5) min < med < max: -46.4 < -1.5 < 68.6 IQR (CV) : 3.4 (-2.1) 283 distinct values 0 (0%)
47 DBPM [numeric] Mean (sd) : -0.5 (2.1) min < med < max: -23.1 < -0.5 < 17.1 IQR (CV) : 2.5 (-4.4) 185 distinct values 0 (0%)
48 BPM [numeric] Mean (sd) : -2.1 (4.2) min < med < max: -59 < -1.8 < 54.4 IQR (CV) : 4.2 (-2) 334 distinct values 0 (0%)
49 VORP [numeric] Mean (sd) : 0.5 (1.3) min < med < max: -2.2 < 0 < 12.4 IQR (CV) : 1.1 (2.4) 112 distinct values 0 (0%)
50 year [integer] Mean (sd) : 2011.7 (4.7) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 8 (0) 16 distinct values 0 (0%)

For Players_bio data, we join players’ data and biographical data and turn the variables into numerics and characters according to their content.

Scroll down the table to see more details

print(dfSummary(Players_bio,headings = FALSE,plain.ascii = FALSE,valid.col = FALSE,graph.magnif = 0.75,style = "grid"),max.tbl.height = 500,method='render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 Rk [numeric] Mean (sd) : 239.6 (137.8) min < med < max: 1 < 239 < 540 IQR (CV) : 238 (0.6) 540 distinct values 0 (0%)
2 Player [character] 1. Mike James 2. Mike Dunleavy 3. Chris Johnson 4. David Lee 5. Corey Brewer 6. Kyle Korver 7. Andre Miller 8. Devin Harris 9. Nazr Mohammed 10. Trevor Ariza [ 1636 others ]
42(0.4%)
36(0.4%)
28(0.3%)
28(0.3%)
22(0.2%)
22(0.2%)
21(0.2%)
21(0.2%)
21(0.2%)
21(0.2%)
9460(97.3%)
0 (0%)
3 Pos [character] 1. SG 2. PF 3. PG 4. C 5. SF 6. C-PF 7. PG-SG 8. SF-SG 9. PF-SF 10. SG-SF [ 5 others ]
1984(20.4%)
1973(20.3%)
1945(20.0%)
1886(19.4%)
1778(18.3%)
23(0.2%)
21(0.2%)
20(0.2%)
19(0.2%)
19(0.2%)
54(0.6%)
0 (0%)
4 Age [numeric] Mean (sd) : 26.6 (4.2) min < med < max: 18 < 26 < 44 IQR (CV) : 7 (0.2) 26 distinct values 0 (0%)
5 Tm [character] 1. TOT 2. HOU 3. CLE 4. NYK 5. MEM 6. PHI 7. LAC 8. MIL 9. WAS 10. DAL [ 26 others ]
986(10.1%)
321(3.3%)
319(3.3%)
313(3.2%)
309(3.2%)
301(3.1%)
299(3.1%)
299(3.1%)
299(3.1%)
296(3.0%)
5980(61.5%)
0 (0%)
6 G [numeric] Mean (sd) : 46.6 (26.3) min < med < max: 1 < 51 < 85 IQR (CV) : 49 (0.6) 85 distinct values 0 (0%)
7 GS [numeric] Mean (sd) : 21.9 (27.4) min < med < max: 0 < 7 < 83 IQR (CV) : 40 (1.2) 84 distinct values 0 (0%)
8 MP [numeric] Mean (sd) : 1078 (877.4) min < med < max: 0 < 887 < 3424 IQR (CV) : 1483 (0.8) 2828 distinct values 0 (0%)
9 FG [numeric] Mean (sd) : 166.1 (164.4) min < med < max: 0 < 114 < 978 IQR (CV) : 225 (1) 727 distinct values 0 (0%)
10 FGA [numeric] Mean (sd) : 366.9 (352.8) min < med < max: 0 < 260 < 2173 IQR (CV) : 489 (1) 1370 distinct values 0 (0%)
11 FG% [numeric] Mean (sd) : 0.4 (0.1) min < med < max: 0 < 0.4 < 1 IQR (CV) : 0.1 (0.2) 458 distinct values 53 (0.55%)
12 3P [numeric] Mean (sd) : 33.1 (46.4) min < med < max: 0 < 10 < 402 IQR (CV) : 52 (1.4) 249 distinct values 0 (0%)
13 3PA [numeric] Mean (sd) : 93 (122.9) min < med < max: 0 < 34 < 1028 IQR (CV) : 146 (1.3) 552 distinct values 0 (0%)
14 3P% [numeric] Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.3 < 1 IQR (CV) : 0.2 (0.6) 380 distinct values 1490 (15.33%)
15 2P [numeric] Mean (sd) : 133 (140.6) min < med < max: 0 < 85 < 798 IQR (CV) : 174 (1.1) 644 distinct values 0 (0%)
16 2PA [numeric] Mean (sd) : 273.9 (280.7) min < med < max: 0 < 182 < 1655 IQR (CV) : 350 (1) 1140 distinct values 0 (0%)
17 2P% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1 IQR (CV) : 0.1 (0.2) 446 distinct values 98 (1.01%)
18 eFG% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1.5 IQR (CV) : 0.1 (0.2) 473 distinct values 53 (0.55%)
19 FT [numeric] Mean (sd) : 80.1 (98.8) min < med < max: 0 < 44 < 756 IQR (CV) : 99 (1.2) 515 distinct values 0 (0%)
20 FTA [numeric] Mean (sd) : 105.7 (125.1) min < med < max: 0 < 61 < 916 IQR (CV) : 129 (1.2) 615 distinct values 0 (0%)
21 FT% [numeric] Mean (sd) : 0.7 (0.2) min < med < max: 0 < 0.8 < 1 IQR (CV) : 0.2 (0.2) 582 distinct values 475 (4.89%)
22 ORB [numeric] Mean (sd) : 48.4 (57.1) min < med < max: 0 < 28 < 440 IQR (CV) : 56 (1.2) 310 distinct values 0 (0%)
23 DRB [numeric] Mean (sd) : 139.5 (137.5) min < med < max: 0 < 102 < 894 IQR (CV) : 174 (1) 650 distinct values 0 (0%)
24 TRB [numeric] Mean (sd) : 187.8 (188.3) min < med < max: 0 < 133 < 1247 IQR (CV) : 228 (1) 837 distinct values 0 (0%)
25 AST [numeric] Mean (sd) : 97.4 (123.3) min < med < max: 0 < 53 < 925 IQR (CV) : 117 (1.3) 610 distinct values 0 (0%)
26 STL [numeric] Mean (sd) : 33.6 (32.5) min < med < max: 0 < 24 < 217 IQR (CV) : 44 (1) 179 distinct values 0 (0%)
27 BLK [numeric] Mean (sd) : 21.2 (30.4) min < med < max: 0 < 10 < 307 IQR (CV) : 23 (1.4) 208 distinct values 0 (0%)
28 TOV [numeric] Mean (sd) : 61.4 (59.5) min < med < max: 0 < 44 < 464 IQR (CV) : 78 (1) 304 distinct values 0 (0%)
29 PF [numeric] Mean (sd) : 93.3 (71.2) min < med < max: 0 < 83 < 332 IQR (CV) : 117 (0.8) 304 distinct values 0 (0%)
30 PTS [numeric] Mean (sd) : 445.5 (448.9) min < med < max: 0 < 303 < 2832 IQR (CV) : 600 (1) 1656 distinct values 0 (0%)
31 Year [numeric] Mean (sd) : 2011.7 (4.7) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 8 (0) 16 distinct values 0 (0%)
32 year_start [integer] Mean (sd) : 2006.5 (6.7) min < med < max: 1952 < 2007 < 2018 IQR (CV) : 10 (0) 44 distinct values 275 (2.83%)
33 year_end [integer] Mean (sd) : 2014.3 (5) min < med < max: 1958 < 2016 < 2018 IQR (CV) : 6 (0) 31 distinct values 275 (2.83%)
34 position [character] 1. C 2. C-F 3. F 4. F-C 5. F-G 6. G 7. G-F
1064(11.3%)
328(3.5%)
2746(29.1%)
795(8.4%)
359(3.8%)
3410(36.1%)
745(7.9%)
275 (2.83%)
35 height [numeric] Mean (sd) : 200.6 (9.1) min < med < max: 165.1 < 200.7 < 228.6 IQR (CV) : 15.2 (0) 22 distinct values 275 (2.83%)
36 weight [integer] Mean (sd) : 219.8 (26.9) min < med < max: 135 < 220 < 360 IQR (CV) : 40 (0.1) 120 distinct values 275 (2.83%)
37 birth_date [character] 1. June 26, 1984 2. June 1, 1985 3. March 25, 1986 4. May 19, 1976 5. August 17, 1986 6. March 5, 1986 7. December 2, 1978 8. September 28, 1982 9. October 26, 1985 10. April 1, 1988 [ 1406 others ]
45(0.5%)
35(0.4%)
29(0.3%)
28(0.3%)
27(0.3%)
26(0.3%)
25(0.3%)
25(0.3%)
24(0.3%)
22(0.2%)
9161(97.0%)
275 (2.83%)
38 college [character] 1. 2. University of Kentucky 3. Duke University 4. University of North Carol 5. University of California, 6. University of Kansas 7. University of Arizona 8. University of Connecticut 9. University of Florida 10. University of Texas at Au [ 224 others ]
1582(16.7%)
326(3.5%)
287(3.0%)
268(2.8%)
242(2.6%)
229(2.4%)
212(2.2%)
184(1.9%)
162(1.7%)
158(1.7%)
5797(61.4%)
275 (2.83%)

For teams’ data, we split them into two datasets Team_split and Team_shooting.

Teams_splits contains all the ‘per game’ stats for each 30 team every season. We choose ‘Location’ filter because all the teams have to play 41 Home game and 41 Road games every year and we simply calculate the mean to get seasonal average stats. We changed the format, removed the ranking variables, combined the basic with advanced data, and put all 15 years data into this one dataset.

Scroll down the table to see more details

print(dfSummary(Team_splits,headings = FALSE,plain.ascii = FALSE,valid.col = FALSE,graph.magnif = 0.75,style = "grid"),max.tbl.height = 500,method='render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 team [character] 1. Atlanta Hawks 2. Boston Celtics 3. Chicago Bulls 4. Cleveland Cavaliers 5. Dallas Mavericks 6. Denver Nuggets 7. Detroit Pistons 8. Golden State Warriors 9. Houston Rockets 10. Indiana Pacers [ 26 others ]
16(3.3%)
16(3.3%)
16(3.3%)
16(3.3%)
16(3.3%)
16(3.3%)
16(3.3%)
16(3.3%)
16(3.3%)
16(3.3%)
319(66.6%)
0 (0%)
2 pctWins [numeric] Mean (sd) : 0.5 (0.2) min < med < max: 0.1 < 0.5 < 0.9 IQR (CV) : 0.2 (0.3) 115 distinct values 0 (0%)
3 fgm [numeric] Mean (sd) : 37.5 (2.1) min < med < max: 32.4 < 37.3 < 44 IQR (CV) : 2.7 (0.1) 168 distinct values 0 (0%)
4 fga [numeric] Mean (sd) : 82.5 (3.6) min < med < max: 74.2 < 82.2 < 94 IQR (CV) : 5.1 (0) 220 distinct values 0 (0%)
5 pctFG [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.5 IQR (CV) : 0 (0) 125 distinct values 0 (0%)
6 fg3m [numeric] Mean (sd) : 7.4 (2.3) min < med < max: 2.8 < 7 < 16.1 IQR (CV) : 3 (0.3) 164 distinct values 0 (0%)
7 fg3a [numeric] Mean (sd) : 20.7 (6.1) min < med < max: 8.2 < 19.5 < 45.3 IQR (CV) : 8.3 (0.3) 294 distinct values 0 (0%)
8 pctFG3 [numeric] Mean (sd) : 0.4 (0) min < med < max: 0.3 < 0.4 < 0.4 IQR (CV) : 0 (0.1) 478 distinct values 0 (0%)
9 pctFT [numeric] Mean (sd) : 0.8 (0) min < med < max: 0.7 < 0.8 < 0.8 IQR (CV) : 0 (0) 206 distinct values 0 (0%)
10 fg2m [numeric] Mean (sd) : 30.1 (1.9) min < med < max: 23.1 < 30.2 < 35.2 IQR (CV) : 2.4 (0.1) 151 distinct values 0 (0%)
11 fg2a [numeric] Mean (sd) : 61.8 (4.6) min < med < max: 41.9 < 62.1 < 74.3 IQR (CV) : 6.1 (0.1) 253 distinct values 0 (0%)
12 pctFG2 [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.6 IQR (CV) : 0 (0) 479 distinct values 0 (0%)
13 ftm [numeric] Mean (sd) : 18.2 (2) min < med < max: 12.2 < 18.1 < 24.1 IQR (CV) : 2.6 (0.1) 153 distinct values 0 (0%)
14 fta [numeric] Mean (sd) : 24 (2.6) min < med < max: 16.6 < 23.9 < 31.6 IQR (CV) : 3.3 (0.1) 196 distinct values 0 (0%)
15 oreb [numeric] Mean (sd) : 11 (1.3) min < med < max: 7.6 < 10.9 < 14.6 IQR (CV) : 1.7 (0.1) 113 distinct values 0 (0%)
16 dreb [numeric] Mean (sd) : 31.5 (2.1) min < med < max: 26.9 < 31.2 < 40.5 IQR (CV) : 3 (0.1) 159 distinct values 0 (0%)
17 treb [numeric] Mean (sd) : 42.4 (2) min < med < max: 36.8 < 42.2 < 49.7 IQR (CV) : 2.7 (0) 154 distinct values 0 (0%)
18 ast [numeric] Mean (sd) : 21.9 (2) min < med < max: 17.4 < 21.6 < 30.4 IQR (CV) : 2.6 (0.1) 157 distinct values 0 (0%)
19 tov [numeric] Mean (sd) : 14.4 (1.1) min < med < max: 11.2 < 14.4 < 17.7 IQR (CV) : 1.4 (0.1) 106 distinct values 0 (0%)
20 stl [numeric] Mean (sd) : 7.5 (0.9) min < med < max: 5.5 < 7.5 < 10 IQR (CV) : 1.1 (0.1) 81 distinct values 0 (0%)
21 blk [numeric] Mean (sd) : 4.9 (0.8) min < med < max: 2.4 < 4.8 < 8.2 IQR (CV) : 1 (0.2) 78 distinct values 0 (0%)
22 blka [numeric] Mean (sd) : 4.9 (0.7) min < med < max: 3 < 4.9 < 6.9 IQR (CV) : 0.9 (0.1) 71 distinct values 0 (0%)
23 pf [numeric] Mean (sd) : 20.9 (1.7) min < med < max: 16.6 < 20.8 < 26.7 IQR (CV) : 2.4 (0.1) 137 distinct values 0 (0%)
24 pts [numeric] Mean (sd) : 100.5 (5.9) min < med < max: 85.5 < 99.7 < 118.2 IQR (CV) : 7.6 (0.1) 296 distinct values 0 (0%)
25 pfd [numeric] Mean (sd) : 19.5 (5.1) min < med < max: 0 < 20.4 < 25.6 IQR (CV) : 2.2 (0.3) 119 distinct values 32 (6.68%)
26 pctAST [numeric] Mean (sd) : 0.6 (0) min < med < max: 0.5 < 0.6 < 0.7 IQR (CV) : 0.1 (0.1) 237 distinct values 0 (0%)
27 pctOREB [numeric] Mean (sd) : 0.3 (0) min < med < max: 0.2 < 0.3 < 0.4 IQR (CV) : 0 (0.1) 191 distinct values 0 (0%)
28 pctDREB [numeric] Mean (sd) : 0.7 (0) min < med < max: 0.7 < 0.7 < 0.8 IQR (CV) : 0 (0) 174 distinct values 0 (0%)
29 pctTREB [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.5 < 0.5 < 0.5 IQR (CV) : 0 (0) 119 distinct values 0 (0%)
30 pctTOVTeam [numeric] Mean (sd) : 0.2 (0) min < med < max: 0.1 < 0.2 < 0.2 IQR (CV) : 0 (0.1) 112 distinct values 0 (0%)
31 pctEFG [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.6 IQR (CV) : 0 (0) 172 distinct values 0 (0%)
32 pctTS [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.5 < 0.5 < 0.6 IQR (CV) : 0 (0) 151 distinct values 0 (0%)
33 ortgE [numeric] Mean (sd) : 104.1 (3.7) min < med < max: 92.3 < 103.9 < 113.9 IQR (CV) : 5.4 (0) 227 distinct values 0 (0%)
34 ortg [numeric] Mean (sd) : 105.7 (3.7) min < med < max: 94.4 < 105.3 < 114.9 IQR (CV) : 5.1 (0) 232 distinct values 0 (0%)
35 drtgE [numeric] Mean (sd) : 104.1 (3.6) min < med < max: 91.6 < 104.2 < 115.1 IQR (CV) : 5.1 (0) 229 distinct values 0 (0%)
36 drtg [numeric] Mean (sd) : 105.7 (3.5) min < med < max: 93.1 < 105.8 < 116.8 IQR (CV) : 4.9 (0) 223 distinct values 0 (0%)
37 netrtgE [numeric] Mean (sd) : 0 (5) min < med < max: -15.5 < 0 < 12.1 IQR (CV) : 7 (672.1) 274 distinct values 0 (0%)
38 netrtg [numeric] Mean (sd) : 0 (4.7) min < med < max: -15.1 < 0.1 < 11.4 IQR (CV) : 6.8 (420.7) 269 distinct values 0 (0%)
39 ratioASTtoTO [numeric] Mean (sd) : 1.5 (0.2) min < med < max: 1 < 1.5 < 2.1 IQR (CV) : 0.3 (0.1) 151 distinct values 0 (0%)
40 ratioAST [numeric] Mean (sd) : 16.8 (1.2) min < med < max: 14.1 < 16.7 < 21.2 IQR (CV) : 1.5 (0.1) 106 distinct values 0 (0%)
41 paceE [numeric] Mean (sd) : 95.7 (3.5) min < med < max: 88.6 < 95.3 < 106.5 IQR (CV) : 4.9 (0) 227 distinct values 0 (0%)
42 pace [numeric] Mean (sd) : 94.3 (3.4) min < med < max: 87.4 < 93.9 < 104.6 IQR (CV) : 4.8 (0) 432 distinct values 0 (0%)
43 ratioPIE [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.6 IQR (CV) : 0 (0.1) 211 distinct values 0 (0%)
44 year [integer] Mean (sd) : 2011.5 (4.6) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 7.5 (0) 16 distinct values 0 (0%)

Team_shooting contains all the shooting performance of each team from different regions on the court. We cleaned them the same way as Team_splits

Scroll down the table to see more details

print(dfSummary(Team_shooting,headings = FALSE,plain.ascii = FALSE,valid.col = FALSE,graph.magnif = 0.75,style = "grid"),max.tbl.height = 500,method='render')
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 team [character] 1. Atlanta Hawks 2. Boston Celtics 3. Chicago Bulls 4. Cleveland Cavaliers 5. Dallas Mavericks 6. Denver Nuggets 7. Detroit Pistons 8. Golden State Warriors 9. Houston Rockets 10. Indiana Pacers [ 26 others ]
80(3.3%)
80(3.3%)
80(3.3%)
80(3.3%)
80(3.3%)
80(3.3%)
80(3.3%)
80(3.3%)
80(3.3%)
80(3.3%)
1595(66.6%)
0 (0%)
2 distance [character] 1. 16-24 ft. 2. 24+ ft. 3. 8-16 ft. 4. Back Court Shot 5. Less Than 8 ft.
479(20.0%)
479(20.0%)
479(20.0%)
479(20.0%)
479(20.0%)
0 (0%)
3 fgm [numeric] Mean (sd) : 607.1 (528.8) min < med < max: 0 < 474 < 2259 IQR (CV) : 467.5 (0.9) 939 distinct values 0 (0%)
4 fga [numeric] Mean (sd) : 1335.9 (949.2) min < med < max: 3 < 1230 < 3891 IQR (CV) : 1225 (0.7) 1309 distinct values 0 (0%)
5 pctFG [numeric] Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.4 < 0.6 IQR (CV) : 0.1 (0.5) 293 distinct values 0 (0%)
6 year [integer] Mean (sd) : 2011.5 (4.6) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 8 (0) 16 distinct values 0 (0%)



IV. Missing values


As we can see in the aforementioned tables, there is no missing value in Teams_splits and Team_shooting. Also, since Player and Players_bio are similar to each other, we are going to display the missing values of Players_bio here.

visna(Players_bio)
Figure 1: Missing values

Figure 1: Missing values

Figure 1 shows that the marjority of the data has no missing values.

Those lines that have missed yearstart variable also missed all the following variables. This is because these columns come from another table: bio. Although the bio table itself has no missing values, it does not contain all the players as Player data has.

Also, we can see that there are quite some rows missing 3PA values, FT values etc. These varibales are related to player’s shooting data per season. The missing values mean that these players do not shoot that season.



V. Results


Overall Offensive Performance

Team_splits %>% select(year, pts, pace) %>% group_by(year) %>% summarise(Pace = mean(pace), Points = mean(pts)) %>%
  gather(key = 'type', value = 'value', -year) %>%
  ggplot(aes(x = year, y = value)) +
  geom_line(color = '#C9082A', size = 2) +
  geom_point(color = '#C9082A', size = 4) +
  geom_point(color = '#FFFFFF', size = 2) +
  #scale_color_manual(values = c('#17408B', '#C9082A')) +
  facet_grid(type ~ ., scales = 'free_y') +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  xlab('') +
  ylab('') +
  ggtitle('Pace and Points Per Game') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 10), 
        axis.text.y = element_text(color = "#000000", size = 10), 
        strip.text.y = element_text(size = 10, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'),
        legend.position = 'none')

There ’s an obvious trend in both Pace (the number of possessions a team uses per game) and PPG (Points Per Game) of NBA games in recent 15 years. We can see that from 2004 to 2013 the pace and PPG are fluctuating around 93 and 98 respectively, but from 2014 these two stats keep growing and especially in 2019 the pace rise to 101 from 98 last year and PPG increases by nearly 6 points more than last season. It’s easy to find a positive associaiton between pace and PPG since the more possessions you have the more chances you can score.

Team_splits %>% select(year, ortg, pctWins) %>% group_by(year) %>%
  ggplot(aes(x = year, y=ortg, alpha = pctWins, color = 1-pctWins)) +
  geom_jitter(size = 2) +
  geom_smooth(linetype = 'longdash', colour = '#C9082A', se = FALSE, size = 2) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  ggtitle('Average Offensive Rating Per Game') +
  xlab('') +
  ylab('') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 10),
        axis.text.y = element_text(color = "#000000", size = 10),
        plot.title = element_text(size = 17.5, face = 'bold'),
        legend.position = 'none') 

This plot shows average offrtg (offensive rating, a statistic used to measure a team’s offensive performance) of each teams in these 15 years. The color reflects the Win Percentage of each team. The darker the marker is, the more the team wins. Offensive Rating shows that the offensive ability of each team started growing from 2013 and reached an unprecedented level in 2018. We are curious about is there any other reasons for such high offensive performance these years except the high pace?

Scoring

p1 <- Team_splits %>% select(year, fg3a, fg2a) %>%
  gather(key = 'type', value = 'attempt', -c(year)) %>%
  group_by(year, type) %>% summarise(attempt = mean(attempt)) %>%
  ggplot(aes(x = year, y = attempt, group = year)) +
  #geom_boxplot(aes(color = type)) +
  #geom_line() +
  geom_bar(stat = 'identity', fill = '#C9082A') +
  facet_grid(type ~., scales = 'free_y', labeller = as_labeller(c(`fg2a` = '2 pointer', `fg3a` = '3 pointer'))) +
  scale_color_manual(values = c('#17408B', '#C9082A')) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  xlab('') +
  ylab('') +
  #ylim(0, 2500) +
  ggtitle('Field Goals Attempt') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5,color = "#000000", size = 10),
        axis.text.y = element_text(color = "#000000", size = 10),
        strip.text.y = element_text(size = 10, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'),
        legend.position = 'none') 

p2 <- Team_splits %>% select(year, pctFG3, pctFG2) %>% 
  gather(key = 'type', value = 'percentage', -c(year)) %>%
  group_by(year, type) %>% summarise(percentage = mean(percentage)) %>%
  ggplot(aes(x = year, y = percentage)) +
  geom_line(color = '#C9082A', size = 2) +
  geom_point(color = '#C9082A', size = 4) +
  geom_point(color = '#FFFFFF', size = 2) +
  facet_grid(type ~., scales = 'free_y', labeller = as_labeller(c(`pctFG2` = '2 pointer', `pctFG3` = '3 pointer'))) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  xlab('') +
  ylab('') +
  ggtitle('Field Goals Percentage') +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5,color = "#000000", size = 10),
        axis.text.y = element_text(color = "#000000", size = 10),
        strip.text.y = element_text(size = 10, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'))

grid.arrange(p1, p2, ncol = 2)

In basketball, a field goal is a basket scored on any shot or tap other than a free throw, worth two or three points depending on the distance of the attempt from the basket.

This plot shows the FGA (Field Goal Attempt) and FG% (Field Goal Percentage) for both 2 pointer and 3 pointer of the league average performance.

In the left plot, we find that teams in NBA is attempting more and more 3 pointers year by year without decreasing too much 2 pointer attempts. In 2019, FGA for 3 is more than twice of that 15 years ago. Also in 2019, FGA for 3 is beyond 30 and FGA for 2 is below 60, which means in average every three shots in a NBA game ther is one 3 pointer shot in 2019.

The right plot tells the FG% of 2 pointer and 3 pointer from 2004 to 2019. It’s clear that the FG% for 2 keeps growing from 2012 and reached beyond 50% since 2017. The FG% for 3 is fluctuating between 35% and 36% in most years. We can see that teams are trying to make 2 pointers shots more efficient by increasing the FG% of it.

From these two plots, we can see that the strategy of NBA teams to score more is to try more 3 pointers and keep 2 pointers shots more efficient.

Team_shooting$distance <- factor(Team_shooting$distance, levels = unique(Team_shooting$distance))

p1 <- Team_shooting %>% filter(distance != 'Back Court Shot') %>% select(distance, fga, year) %>% group_by(year, distance) %>% summarise_all(mean) %>%
  ggplot(aes(x = year, y = fga/82, group = year)) +
  #geom_boxplot() +
  geom_bar(stat = 'identity', fill = '#C9082A') +
  facet_grid(distance ~ ., scales = 'free_y') +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  xlab('') +
  ylab('') +
  #ylim(0,1500) +
  ggtitle('Field Goals Attempt by Distance') +
  theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 10), 
          axis.text.y = element_text(color = "#000000", size = 10),
          legend.position = 'none',
          strip.text.y = element_text(size = 10, colour = '#FFFFFF', face = 'bold'),
          strip.background = element_rect(fill = '#17408B', colour = 'white'),
          plot.title = element_text(size = 17.5, face = 'bold'))

p2 <- Team_shooting %>% filter(distance != 'Back Court Shot') %>% select(distance, pctFG, year) %>% group_by(year, distance) %>% summarise_all(mean) %>%
  ggplot(aes(x = year, y = pctFG)) +
  geom_line(color = '#C9082A', size = 2) +
  geom_point(color = '#C9082A', size = 4) +
  geom_point(color = '#FFFFFF', size = 2) +
  facet_grid(distance ~ ., scales = 'free_y') +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  xlab('') +
  ylab('') +
  ggtitle('Field Goals Percentage by Distance') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 10), 
        axis.text.y = element_text(color = "#000000", size = 10),
        legend.position = 'none',
        strip.text.y = element_text(size = 10, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'))

grid.arrange(p1, p2, ncol = 2)

This plot shows FGA and FG% of shots from different region on the court. The distance is how far the shooting spot is from the basket. Shots beyond 23 feet 9 inches from the basket is 3 pointers and others are 2 pointers. The 24+ ft data are similar with that of the 3 pointer in the plot above.

This plot decompose 2 pointer shots into 3 types – ‘near basket’, ‘mid-range’, ‘long-range’.

We can see from the left plot that ‘near basket’ 2 pointers’ FGA is the most among all and it reaches 30 in 2019 which is even more than the sum of other two types. While ‘long-range’ shots keeps going down and ‘mid-range’ remains around 12. Considering the difficulty of making a field goal rises with the distance from the basket, ‘long-range’ shots seems to be less valuable than ‘near basket’ ones. In the right plot, we can see ‘near basket’ shots’ FG% goes far beyond others and reached 58% in 2019 while ‘mid-range’ shots’ FG% also keeps rising.

This may explain how the NBA teams makes it to keeping throwing more 3 pointers and in the meanwhile raise the FG% of 2 pointers. They decrease the attempts to shoot from ‘low efficence’ regions and focus more near the basket.

Team_splits %>% select(year, pctTS, pctWins) %>% group_by(year) %>%
  ggplot(aes(x = year, y=pctTS, alpha = pctWins, color = 1-pctWins)) +
  geom_jitter(size = 2) +
  geom_smooth(linetype = 'longdash', colour = '#C9082A', se = FALSE, size = 2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5), legend.position = 'none') +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  ggtitle('Average True Shooting Percentage Per Game') +
  xlab('') +
  ylab('') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 10),
        axis.text.y = element_text(color = "#000000", size = 10),
        plot.title = element_text(size = 17.5, face = 'bold'),
        legend.position = 'none') 

TS% (True Shooting Percentage, measures efficiency at shooting the ball) synthesizes field goal percentage, free throw percentage, and three-point field goal percentage instead of take them individually to calculate shooting more accurately. The same as before, the darker the marker is, the more the team wins. It’s easy to find that the curve of TS% shares the simialr shape of that of offrtg curve and teams at present shoots much more efficiently than 15 years ago.

Sharing the ball

Basketball teamwork is in fact very important as it allows the team to function together and not individually. During offensive situations, teamwork is vital because you need to confuse the defense on who will take the shot or where the shot will come from. If there is only one person making the shot for the team, then the defense will mostly concentrate their efforts in putting a stop to their scorer.

Team_splits %>% select(year, ast, tov) %>% group_by(year) %>% summarise_all(mean) %>%
  gather(key = 'type', value = 'value', -year) %>%
  ggplot(aes(x = year, y=value)) +
  geom_line(color = '#C9082A', size = 2) +
  geom_point(color = '#C9082A', size = 4) +
  geom_point(color = '#FFFFFF', size = 2) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  ggtitle('Average Assists Per Game') +
  facet_grid(type ~ ., scales = 'free_y') +
  xlab('') +
  ylab('') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 10), 
        axis.text.y = element_text(color = "#000000", size = 10),
        legend.position = 'none',
        strip.text.y = element_text(size = 10, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'))

Ast (Assist, attributed to a player who passes the ball to a teammate in a way that leads to a score by field goal) roughly measures the willingness and ability of a team to share the ball and tov (Turnover, occurs when a team loses possession of the ball to the opposing team before a player takes a shot at their team’s basket) give a angle to view how disciplined the team is.

In this plot, we can see that the assit rising dramaticly in 2013 and since then it keeps growing. While turnover first rises until 2014 and starts drop till now. It’s likely that in 2013 and 2014 teams started to speed up and encourage passing while players didn’t get used to this style and s lot of passes turns into turnover. From 2015, teams began to figure out how to pass the ball right to the scorer and reduce bad passes.

Team_splits %>% select(year, ratioAST, pctWins) %>% group_by(year) %>%
  ggplot(aes(x = year, y=ratioAST, alpha = pctWins, color = 1-pctWins)) +
  geom_jitter(size = 2) +
  geom_smooth(linetype = 'longdash', colour = '#C9082A', se = FALSE, size = 2) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  ggtitle('Assist Ratio Per Game') +
  xlab('') +
  ylab('') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 10), 
        axis.text.y = element_text(color = "#000000", size = 10),
        legend.position = 'none',
        plot.title = element_text(size = 17.5, face = 'bold'))

Assist Ratio is the percentage of a team’s possessions that ends in an assist. The growth is obvious in recent three years and we can see there are teams pretty good at sharing balls these years and they all recieve outstanding grades.

Height/Weight Ratio

data <- Players_bio%>%
  filter(Year>=2004)%>%
  select(Player,height,weight,Year)%>%
  distinct()%>%
  drop_na()%>%
  as.data.frame(stringsAsFactors = F)%>%
  select(height,weight,Year)%>%
  dplyr::group_by(Year)%>%
  dplyr::summarise(avg_h=mean(height),avg_w=mean(weight))%>%
  dplyr::ungroup()%>%
  mutate(hw_ratio=avg_h/avg_w)%>%
  select(Year,hw_ratio)


ggplot()+
  geom_line(aes(x=Year,y=hw_ratio),data=data,color="#C9082A",size=2)+
  geom_point(aes(x=Year,y=hw_ratio),data=data,color="#C9082A",size=4)+
  geom_point(aes(x=Year,y=hw_ratio),data=data,color="white",size=2)+
  scale_x_continuous(breaks=seq(2004,2019))+
  scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
  xlab('') +
  ylab('') +
  theme_minimal()+
  ggtitle("Average Height/Weight Ratio Per Season")+
  theme(plot.title = element_text(size=17.5,face="bold"),
        axis.text.x = element_text(angle = 45, hjust = 1,color = "#000000", size = 10),
        axis.text.y = element_text(color = "#000000", size = 10))

VI. Interactive Plot




VII. Conclusion




 

A work by Chao Yin & Zeyu Yang

cy2507@columbia.edu | zy2327@columbia.edu